Review of "Introduction to clustering large and high-dimensional data" by J. Kogan

نویسنده

  • Dieter Mitsche
چکیده

Roughly speaking, clustering is a data analysis task to group a set of items into different categories so that items within one category are similar and items between different categories are dissimilar, where similar and dissimilar depend on the definition of distance between items. Although known for many decades, recently clustering has gained a lot of importance due to the exponential growth of digital libraries and the World Wide Web and the thus resulting need to find and extract information. Motivated by these Information Retrieval (IR) applications, which are usually characterized by large, sparse and high-dimensional data, “Introduction to Clustering Large and High-Dimensional Data” by J. Kogan is a textbook that tries to focus on a few clustering techniques that are very common in IR. In particular, it focuses on the kmeans algorithm, which is by far the most popular one in IR, including many of its variations, among them incremental kmeans, spherical k-means, quadratic k-means, k-means with divergences and others.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

High-Dimensional Unsupervised Active Learning Method

In this work, a hierarchical ensemble of projected clustering algorithm for high-dimensional data is proposed. The basic concept of the algorithm is based on the active learning method (ALM) which is a fuzzy learning scheme, inspired by some behavioral features of human brain functionality. High-dimensional unsupervised active learning method (HUALM) is a clustering algorithm which blurs the da...

متن کامل

Feature Selection for Small Sample Sets with High Dimensional Data Using Heuristic Hybrid Approach

Feature selection can significantly be decisive when analyzing high dimensional data, especially with a small number of samples. Feature extraction methods do not have decent performance in these conditions. With small sample sets and high dimensional data, exploring a large search space and learning from insufficient samples becomes extremely hard. As a result, neural networks and clustering a...

متن کامل

Acetylation of wood – A review

Wood is a porous three dimensional, hydroscopic, viscoelastic, anisotropic bio-polymer composite composed of an interconnecting matrix of cellulose, hemicelluloses and lignin with minor amounts of inorganic elements and organic extractives. Some, but not all, of the cell wall polymer hydroxyl groups are accessible to moisture and these accessible hydroxyls form hydrogen bonds with water. As the...

متن کامل

Application of modified balanced iterative reducing and clustering using hierarchies algorithm in parceling of brain performance using fMRI data

Introduction: Clustering of human brain is a very useful tool for diagnosis, treatment, and tracking of brain tumors. There are several methods in this category in order to do this. In this study, modified balanced iterative reducing and clustering using hierarchies (m-BIRCH) was introduced for brain activation clustering. This algorithm has an appropriate speed and good scalability in dealing ...

متن کامل

A Hybrid Time Series Clustering Method Based on Fuzzy C-Means Algorithm: An Agreement Based Clustering Approach

In recent years, the advancement of information gathering technologies such as GPS and GSM networks have led to huge complex datasets such as time series and trajectories. As a result it is essential to use appropriate methods to analyze the produced large raw datasets. Extracting useful information from large data sets has always been one of the most important challenges in different sciences,...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Computer Science Review

دوره 2  شماره 

صفحات  -

تاریخ انتشار 2008